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Abstract. The cornerstone of any algorithm computing all repetitions in a string of 
length n in 0{n) time is the fact that the number of runs (or maximal repetitions) is 
0{n). We give a simple proof of this result. As a consequence of our approach, the 
stronger result concerning the linearity of the sum of exponents of all runs follows easily. 



1. Introduction 

Repetitions in strings constitute one of the most fundamental areas of string combina- 
torics with very important apphcations to text algorithms, data compression, or analysis 
of biological sequences. One of the most important problems in this area was finding an 
algorithm for computing all repetitions in linear time. A major obstacle was encoding all 
repetitions in linear space because there can be O(nlogn) occurrences of squares in a string 
of length n (see [1]). All repetitions are encoded in runs (that is, maximal repetitions) 
and Main [9] used the s-factorization of Crochemore [Ij to give a linear-time algorithm for 
finding all leftmost occurrences of runs. What was essentially missing to have a linear-time 
algorithm for computing all repetitions, was proving that there are at most linearly many 
runs in a string. Iliopoulos et al. [3] showed that this property is true for Fibonacci words. 
The general result was achieved by Kolpakov and Kucherov [7] who gave a linear-time 
algorithm for locating all runs in [6j. 

Kolpakov and Kucherov proved that the number of runs in a string of length n is at 
most cn but could not provide any value for the constant c. Recently, Rytter [lOj proved 
that c < 5. The conjecture in [7j is that c = 1 for binary alphabets, as supported by 
computations for string lengths up to 31. Using the technique of this note, we have proved 
[2] that it is smaller than 1.6, which is the best value so far. 
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Both proofs in [6] and [TO] are very intricate and our contribution is a simple proof 
of the hnearity. On the one hand, the search for a simple proof is motivated by the very 
importance of the result - this is the core of the analysis of any optimal algorithm computing 
all repetitions in strings. None of the above-mentioned proofs can be included in a textbook. 
We believe that the simple proof shows very clearly why the number of runs is linear. On 
the other hand, a better understanding of the structure of runs could pave the way for 
simpler linear-time algorithms for finding all repetitions. For the algorithm of [6J (and [9]), 
relatively complicated and space-consuming data structures are needed, such as suffix trees. 

The technical contribution of the paper is based on the notion of (5-close runs (runs 
having close centers), which is an improvement on the notion of neighbors (runs having 
close starting positions) introduced by Rytter [10]. 

On top of that, our approach enables us to derive easily the stronger result concerning 
the linearity of the sum of exponents of all runs of a string. Clearly this result implies the 
first one, but the converse is not obvious. The second result was given another long proof 
in [7]; it follows also from [lOj . 

Finally, we strongly believe that our ideas in this paper can be further refined to improve 
significantly the upper bound on the number of runs, if not to prove the conjecture. The 
latest refinements and computations (December 2007) show a 1.084n bound. 

2. Definitions 

Let A be an alphabet and A* the set of all finite strings over A. We denote by the 
length of a string w, by wli] its ith letter, and by w[i . . j] its factor t(7[z]t(7[i + !]••• w[j]. We 
say that w has period p iff = w[i + p], for all 1 < i < \w\ — p. The smallest period of w 
is called the period of w and the ratio between the length and the period of w is called the 
exponent of w. 

For a positive integer n, the nth power of w is defined inductively by = w, w"^ = 
w'^~^w. A string is primitive if it cannot be written as a proper integer (two or more) power 
of another string. Any nonempty string can be uniquely written as an integer power of a 
primitive string, called its primitive root. It can also be uniquely written in the form u^v 
where \u\ is its (smallest) period, e is the integral part of its exponent, and u is a proper 
prefix of u. 

The following well-known synchronization property will be useful: If w is primitive, 
then w appears as a factor of ww only as a prefix and as a suffix (not in-between). Another 
property we use is Fine and Wilf's periodicity lemma: If w has periods p and q and \w\ > 
p + q, then w has also period gcd(p, q). (This is a bit weaker than the original lemma which 
works as soon as \w\ > p + q — gcd(p, g), but it is good enough for our purpose.) We refer 
the reader to [8] for all concepts used here. 

For a string w = w\\...n\, a rw^ (or maximal repetition) is an interval 1 < 

i < j < n, such that (i) the factor w[i . . j] is periodic (its exponent is 2 at least) and (ii) 
both w[i — 1 . . j] and vj[i ■ ■ j + 1], if defined, have a strictly higher (smallest) period. As 
an example, consider vu = abbababbaba; [3 . . 7] is a run with period 2 and exponent 2.5; we 
have U7[3.. 7] = babab = (ba)^-^. Other runs are [2.. 3], [7.. 8], [8.. 11], [5 . . 10] and [1.. 11]. 
For a run starting at i and having period \x\ = p, we shall call w[i . . i + 2p — 1] = x"^ the 
square of the run (this is the only part of a run we can count on). Note that x is primitive 



Runs were introduced in [9] under the name maximal periodicities; the are caUed m-repetitions in and 
runs in [1]. 
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and the square of a run cannot be extended to the left (with the same period) but may be 
extendable to the right. The center of the run is the position c = i + p. We shall denote 
the beginning of the run by ix = i, the end of its square by = + 2p — 1, and its center 
by Cx = ix+ P- 

3. Linear number of runs 

We describe in this section our proof of the linear number of runs. The idea is to 
partition the runs by grouping together those having close centers and similar periods. To 
this aim, for any 6 > 0, we say that two runs having squares x'^ and are 6-close if (i) 
\cx — Cy\ ^ S and (h) 25 < \y\ < 36. We prove that there cannot be more than three 
mutually 5-close runs. (There is one exception to this rule - case (vi) below - but then, even 
fewer runs are obtained.) This means that the number of runs with the periods between 26 
and 35 in a string of length n is at most Summing up for values 6i = ^ > 0, all 

periods are considered and we obtain that the number of runs is at most 

oo oo 

i=o °' i=o 

For this purpose, we start investigating what happens when three runs in a string w are 
(5-close. Let us denote their squares by a;^,y^,z^, their periods by \x\ = p, \y\ = q, \z\ = r, 
and assume p < q < r. We discuss below all the ways in which and can be positioned 
relative to each other and see that long factors of both runs have small periods which 
has to synchronize. This will restrict the beginning of to only one choice as otherwise 
some run would be left extendable. Then a fourth run 5-close to the previous three cannot 
exist. 

Notice that, for cases (i)-(v) we assume the centers of the runs are different; the case 
when they coincide is covered by (vi) . 

(i) {iy < ix <)cy < Cx < Cx < Cy. Then x and the suffix of length Cy — Cx of y have 
period q — p; see Fig. [l^i). We may assume the string corresponding to this period is a 
primitive string as otherwise we can make the same reasoning with its primitive root. 

Since z'^ is (5-close to both and y^, it must be that Cz G [cx — 6 . . Cy + 6]. Consider 
the interval of length q — p that ends at the leftmost possible position for c^, that is, 
I = [cx — 6 — {q — p) . . Cx — 6 — 1]. It is included in the first period of z^, that is, [iz . . Cz — l], 
and in [ix ■ -Cy]. Thus w[I] is primitive and equal, due to z^, to + r] which is a factor of 
w[cx ■ ■ Cy]. Therefore, the periods inside the former must synchronize with the ones in the 
latter. It follows, in the case iz > ix — {q —p): that w[iz — 1] = w[cz — 1], that is, is left 
extendable, a contradiction. If < ix — iQ—p)i then wlcx — I] = wlix — iq—p) — !] = w[ix — 1], 
that is, is left extendable, a contradiction. The only possibility is that iz = ix — {Q—p) and 
r equals q plus a multiple of q — p. Here is an example: w = baabababaababababaab, = 
w[5 . . 14] = (ababa)2, y^ = w[l . . 14] = (baababa)2, and = ^[3 . . 20] = (abababaab)2. 

We have already, due to z"^, that x = p^p', where \p\ = q — p and p' a prefix of p. A 
fourth run (5-close to the previous three would have to have the same beginning as and 
the length of its period would have to be also q plus a multiple of q — p. This would imply 
an equation of the form p'^p' = p'p"^ and then p and p' are powers of the same string, a 
contradiction with the primitivity of x. 

(ii) {iy < ix <)cy < Cx < Cy < Cx] thls is similar with (i); see Fig. [Hii). Here the prefix 
of length Cy — Ca; of x is a suffix of y and has period q — p. 
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(Hi) 



(iv) 




(V) 



(vi) 



Figure 1: Relative position of and y^. 



(iii) iy < ix < Cx < Cy{< Cx < ey). Here x and the prefix of length Cx 



of y have 



period q — p; see Fig. [D^iii). As above, a third (5-close run z'^ would have to share the same 
beginning with y^, otherwise one of or would be left extendable. A fourth (^-close run 
would have to start at the same place and, because of the three-prefix-square lemmE0 of [3], 
since p is primitive, it would have a period at least q + r, which is impossible. 

(iv) ix < iy{< Cx < Cy < Cx < Cy); this is similar with (iii); see Fig. [Hiv). A third run 
would begin at the same position as and there is no fourth run. 



(v) 



f] see Fig. [D^v). Here not even a third (5-close run exists because of the 



three-square lemma that implies r >p + q. 



(vi) Cx 



This case is significantly different from the other ones, as we can have 



many 5-close runs here. However, the existence of many runs with the same center implies 
very strong periodicity properties of the string which allow us to count the runs globally 
and obtain even fewer runs than before. 

In this case both x and y have the same small period i = q—p; see Fig. [IJvi). If we note 
c = Cy then we have h runs Xj\ ^ < j < h, beginning at positions ixj = c — {{j — 1)£ + i'), 
where i' is the length of the suffix of x that is a prefix of the period. 

We show that in this case we have less runs than as counted in the sum ()3.ip . For h <9 
there is nothing to prove as no four of our x,^ runs are counted for the same 6. Assume 



h > 10. There exists 6i such that ^ < Si < 



, that is, this Si is considered in (j3.ip . Then 
it is not difficult to see that there is no run in w with period between £ and j£ and center 
inside J = [c + i + 1 . . c + {h - 2)i + i']. But ^ < 25^ < 3Si < U and the length of J is 



For three words u, v, w, it states that if uu is a prefix of vv, vv is a prefix of ww, and u is primitive, then 
+ \v\ < \w\. 
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{h — 3)i + I' > {h + l)6i. This means that at least h intervals of length 5i in the sum (13. ip 
are covered by J and therefore at least 3h runs in (j3.ip are replaced by our h runs. 

We need also mention that these h intervals of length 6i are not reused by a different 
center with multiple runs since such centers cannot be close to each other. Indeed, if we 
have two centers Cj with the above parameters hj,ij, j = 1,2, then, as soon as the longest 
runs overlap over £i + £2 positions, we have ii = £2, due to Fine and Wilf's lemma. Then, 
the closest positions of Ji and J2 cannot be closer than ii = £2 > 5i as this would make 
some of the runs non-primitive, a contradiction. Thus the bound in (j3.ip still holds and we 
proved 

Theorem 3.1. The number of runs in a string of length n is 0{n). 

4. The sum of exponents 

Using the above approach, we show in this section that the sum of exponents of all 
runs is also linear. The idea is to prove that the sum of exponents of all runs with the 
centers in an interval of length 5 and periods between 25 and 3(5 is less than 8. (As in the 
previous proof, there are exceptions to this rule, but in those cases we get a smaller sum 
of exponents.) Then a computation similar to (13. Ih gives that the sum of exponents is at 
most 48n. 

To start with. Fine and Wilf's periodicity lemma can be rephrased as follows: For two 
primitive strings x and y, any powers and cannot have a common factor longer than 
|x| + \y\ as such a factor would have also period gcd(|x|, contradicting the primitivity 
of X and y. 

Next consider two J-close runs, x°^ and y^ , a,/3 G Q. It cannot be that both a and (3 
are 2.5 or larger, as this would imply an overlap of length at least \x\ + \y\ between the two 
runs, which is forbidden by Fine and Wilf's lemma since x and y are primitive. Therefore, 
in case we have three mutually 5-close runs, two of them must have their exponents smaller 
than 2.5. If the exponent of the third run is less than 3, we obtain the total of 8 we were 
looking for. However, the third run, say , 7 G Q, may have a larger exponent. If it does, 
that affects the runs in the neighboring intervals of length 5. More precisely, if 7 > 3, then 
there cannot be any center of run with period between 25 and 2>5 in the next (to the right) 
interval of length 5. Indeed, the overlap between any such run and would imply, as above, 
that their roots are not primitive, a contradiction. In general, the following [2(7 — 2.5)J 
intervals of length 5 cannot contain any center of such runs. Thus, we obtain a smaller sum 
of exponents when this situation is met. 

The second exception is given by case (vi) in the previous proof, that is, when many 
runs share the same center; we use the same notation as in (vi). We need to be aware of the 
exponent of the run , with the smallest period, as ai can be as large as £ (and unrelated 
to /i, the number of runs with the same center). We shall count ai into the appropriate 
interval of length (5i ; notice that x"^ andxg^ are never J-close, for any 5, because \x2\ > 2\xi\. 
For 2 < j < h — 1, the period \xj\ cannot be extended by more than i positions to the right 
past the end of the initial square, and thus a, < 2 + i. Therefore, their contribution to the 
sum of exponents is less than 3(/i — 2). They replace the exponents of the runs with centers 
in the interval J and periods between £ and |^ which otherwise would contribute at least 
6h to the sum of exponents. The run with the longest period, x^'', can have an arbitrarily 
high exponent but the replaced runs in J need to account only for a fraction (3 units) of it 
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since a/i > 3 implies new centers with multiple runs and hence new J intervals (precisely 
[ah — 2j) that account for the rest. We proved 

Theorem 4.1. The sum of exponents of the runs in a string of length n is 0{n). 
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